Previously in Section 9.2, when dealing with linear supervised, learning we saw how normalizing each input feature of a dataset significantly aids in parameter tuning by improving the shape of a cost function's contours (making them more 'circular'). Another way of saying this is that we normalized every distribution that touches a model parameter - which in the linear case consists of the distribution of each input feature. The intuition that normalizing parameter touching distributions aids in effective parameter tuning carries over completely from the linear learning scenario to our current situation - where we are conducting nonlinear learning via multi-layer perceptrons. The difference here is that now we have many more parameters (in comparison to the linear case) and many of these parameters are internal as opposed to weights in a linear combination alone. Nonetheless each parameter - as we detail here - touches a distribution that when normalized tends to improve optimization speed, particularly when using first order methods (those detailed in Chapter 3).
Specifically - as we will investigate here in the context of the multi-layer perceptron - to completely carry over the idea of input normalization to our current scenario we will need to standard normalize the output of each and every network activation. Moreover since these activation distributions naturally change during parameter tuning - e.g., whenever a gradient descent step is made - we must normalize these internal distributions every time we make a parameter update. This leads to the incorporation of a normalization step grafted directly onto the architecture of the multi-layer perceptron itself - which is called every time weights are changed. This natural extension of input normalization is popularly referred to as batch normalization.
In Section 9.2 we described standard normalization, a simple technique for normalizing a linear model that makes minimizing cost functions involving linear models considerably easier. With our generic linear model
\begin{equation} \text{model}\left(\mathbf{x},\mathbf{w}\right) = w_0 + x_1w_1 + \cdots + x_Nw_N \end{equation}
standard normalization involves normalizing the distribution of each input dimension of a dataset of $P$ points e.g., for the $n^{th}$ dimension we make the substitution
\begin{equation} x_{p,n} \longleftarrow \frac{x_{p,n} - \mu_{n}}{\sigma_{n}} \end{equation}
for each input where $\mu_n$ and $\sigma_n$ are the mean and standard deviation along the $n^{th}$ feature of the input, respectively. Importantly, note that we did not perform any kind of normalization to the constant values touching the bias weight $w_0$.
Batch normalization takes this basic standard normalization concept and naturally extends it to models employing multi-layer perceptrons, and provides analagous benefits when optimizing cost functions employing such models. Suppose we take our generic model involving $B = U_{L}$ generic $L$ layer mulit-layer perceptron units $f_1^{(L)},\,f_2^{(L)},\,...f_{U_L}^{(L)}$ as described in the previous Section
\begin{equation} \text{model}\left(\mathbf{x},\Theta\right) = w_0^{\,} + f_1^{(L)}\left(\mathbf{x}\right)w_1^{\,} + \cdots + f_{U_L}^{(L)}\left(\mathbf{x}\right)w_{U_L}^{\,} \end{equation}
and try to extend the standard normalization scheme in introduced in Section 9.2 to every non-bias weight touching distribution of our model. Of course here the input features, or dimensions of our input, no longer touch the weights of the linear combination $w_1,...,w_{U_L}$, they instead touch the weights internal to the first layer of the multi-layer perceptron units themselves. We can see this by analyzing the $j^{th}$ single layer unit in this network
\begin{equation} f^{(1)}_j\left(\mathbf{x}\right)=a\left(w^{\left(1\right)}_{0,\,j}+\underset{n=1}{\overset{N}{\sum}}{w^{\left(1\right)}_{n,\,j}\,x^{\,}_n}\right) \end{equation}
that in such a model the $n^{th}$ dimension of the input $x_n$ touches the single layer weight $w^{\left(1\right)}_{n,\,j}$ (and not e.g., a weight of the final linear combination). Thus in standard normalizing the input we only directly affect the contours of a cost function employing a multi-layer perceptron along the weights internal to the single layer elements. To affect the contours of a cost function with respect to weights external to the first hidden layer (via standard normalizing) we must naturally standard normalize the output of the first hidden layer.
For the sake of simplicity, suppose for a moment $L=1$, and thus our model takes the form $\text{model}\left(\mathbf{x},\Theta\right) = w_0^{\,} + f_1^{(1)}\left(\mathbf{x}\right)w_1^{\,} + \cdots + f_{U_1}^{(1)}\left(\mathbf{x}\right)w_{U_1}^{\,}$ and so the distribution $\left\{f^{(1)}_j\left(\mathbf{x}_p\right) \right\}_{p=1}^P$ directly touches the final linear combination weight $w_j$. In our quest to fully apply the principle of standard normalization to our multi-layer perceptron model we would then naturally want to standard normalize the output of our first hidden layer (i.e., for each unit $j =1,...,U_1$) as
\begin{equation} f_j^{(1)} \left(\mathbf{x} \right) \longleftarrow \frac{f_j^{(1)} \left(\mathbf{x}\right) - \mu_{f_j^{(1)}}}{\sigma_{f_j^{(1)}}} \end{equation}
where
\begin{array} \ \mu_{f_j^{(1)}} = \frac{1}{P}\sum_{p=1}^{P}f_j^{(1)}\left(\mathbf{x}_p \right) \\ \sigma_{f_j^{(1)}} = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(f_j^{(1)}\left(\mathbf{x}_p \right) - \mu_{f_j^{(1)}} \right)^2}. \end{array}
This is certainly easy to accomplish. However it is important to note that - unlike our input features - the output of the single hidden layer (and hence its distribution) changes every time the internal parameters of our model are changed like e.g., during each step of gradient descent. The constant alteration of this distribution of single layer units - called internal covariate shift or just covariate shift in the jargon of machine learning - implies that if we are to carry over the principle of standard normalization completely we will need to standard normalize the output of the first hidden layer at every step of parameter tuning (e.g., via gradient descent). In other words, we need to build standard normalization directly into the perceptron architecture itself.
We show a generic recipe for doing just this - a simple extension of the recipe for single layer units given in the previous Section - below.
1: input: Activation function $a\left(\cdot\right)$ and input data $\left\{\mathbf{x}_p\right\}_{p=1}^P$
2: Compute linear combination: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,v = w_{0}^{(1)}+{\sum_{n=1}^{N}}{w_{n}^{(1)}\,x_n}$
3: Pass result through activation: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, f^{(1)}\left(\mathbf{x}\right) = a\left(v\right)$
4: Compute mean: $\mu_{f^{(1)}}$ / standard deviation $\sigma_{f^{(1)}}$ of: $\,\,\,\,\,\,\,\,\left\{f^{(1)}\left(\mathbf{x}_p\right) \right\}_{p=1}^P$
5: Standard normalize: $ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,
f^{(1)} \left(\mathbf{x} \right) \longleftarrow \frac{f^{(1)} \left(\mathbf{x}\right) - \mu_{f^{(1)}}}{\sigma_{f^{(1)}}}
$
6: output: Batch normalized single layer unit $\,\, f^{(1)} \left(\mathbf{x} \right)$
In this example we illustrate the covariate shift of a single layer perceptron model using two ReLU units $f^{(1)}_1$ and $ f^{(1)}_2$, applied to performing two-class classification using the toy two-class classification dataset shown below.
We now run $5,000$ steps of gradient descent to minimize the softmax cost using this single layer network, where we standard normalize the input data. We use a set of random weights for the network loaded in from memory.
Below we show an animation of this gradient descent run, plotting the single unit distribution $\left\{f^{(1)}_1\left(\mathbf{x}_p\right),\,f^{(1)}_2\left(\mathbf{x}_p\right) \right\}_{p=1}^P$ at a subset of the steps taken during the run. In the left panel we show this covariate shift or activation output distribution at the $k^{th}$ step of the optimization, while the right panel shows the complete cost function history curve where the current step of the animated optimization is marked on the curve with a red dot. Moving the slider from left to right progresses the run from start to finish.
As you can see by moving the slider around, the distribution of activation outputs - i.e., the distributions touching the weights of our model's linear combination $w_0$ and $w_1$ - change dramatically as the gradient descent algorithm progresses. We can intuit (from our previous discussions on input normalization) that this sort of shifting distribution negatively effects the speed at which gradient descent can properly minimize our cost function.
Now we repeat the above experiment using the batch normalized single layer perceptron - making a run of $10,000$ gradient descent steps using the same initialization used above. We then animate the covariate shift / distribution of activation outputs using the same animation tool used above as well. Moving the slider below from left to right - progressing the algorithm - we can see here that the distribution of activation outputs stays considerably more stable.